AITopics | reward estimator

Collaborating Authors

reward estimator

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

The Sample Complexity of Multiclass and Sparse Contextual Bandits

Erez, Liad, Chen, Fan, Cohen, Alon, Koren, Tomer, Mansour, Yishay, Moran, Shay, Rakhlin, Alexander

arXiv.org Machine LearningMay-29-2026

We study contextual bandits in the stochastic i.i.d.\ setting, where a learner observes contexts drawn from an unknown distribution, selects actions from a finite set $A$, and aims to identify an approximately optimal policy from a given class based on bandit feedback. Motivated by bandit multiclass classification with zero-one rewards, we focus on the \emph{$s$-sparse} setting in which, for every context, the reward vector has $L_1$-norm at most $s \ll |A|$. Our main result is the design of algorithms that, with high probability, output an $ε$-optimal policy compared to policy class $Π$ using $\tilde{O} ((s/ε^2 + |A|/ε)\log |Π|/δ)$ samples. We extend this bound to general Natarajan classes and complement it with a matching lower bound (up to logarithmic factors), thereby closing a substantial gap left by prior work (Erez et al., 2024, 2025), which incurred an additional $Θ(|A|^9)$ dependence. We obtain these results via two complementary approaches. First, we analyze contextual bandits through the lens of contextual decision making with structured observations, designing an exploration-by-optimization algorithm whose sample complexity is governed by the \emph{decision-estimation coefficient} (DEC; Foster et al., 2021, 2022). We show that, with $s$-sparse rewards, the induced model class admits a sharp DEC bound that scales with $s$ and directly yields the optimal rate. Since this approach is largely information-theoretic and involves solving complex min-max optimization problems, we also develop a second, more specialized algorithmic method based on a low-variance exploration technique. This approach leads to concrete, tractable algorithms and naturally extends to contextual combinatorial semi-bandits, leading to improved sample complexity guarantees for bandit multiclass list classification.

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Machine Learning

2605.29645

Country: Asia > Middle East > Israel (0.28)

Genre: Research Report (0.64)

Industry: Education > Educational Setting > Online (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining (0.93)

Add feedback

e8542a04d734d0cae36d648b3f519e5c-Supplemental.pdf

Neural Information Processing SystemsMay-1-2026, 03:25:40 GMT

artificial intelligence, data mining, machine learning, (17 more...)

Neural Information Processing Systems

Country: Asia > India (0.28)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications (0.67)
Information Technology > Data Science > Data Mining > Big Data (0.49)

Add feedback

db48d94a42706019262ed8304fa658c5-Paper-Conference.pdf

Neural Information Processing SystemsFeb-18-2026, 09:06:20 GMT

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > California (0.14)
North America > United States > Iowa (0.04)
North America > United States > Illinois > Champaign County > Urbana (0.04)
Europe > Middle East > Malta (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

e8542a04d734d0cae36d648b3f519e5c-Supplemental.pdf

Neural Information Processing SystemsFeb-11-2026, 16:42:34 GMT

control variate, estimator, variate, (13 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Singapore (0.04)
Asia > India > Maharashtra > Mumbai (0.04)

Genre: Research Report (0.66)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Provably Efficient Interaction-Grounded Learning with Personalized Reward

Neural Information Processing SystemsOct-10-2025, 18:36:44 GMT

Interaction-Grounded Learning (IGL) [Xie et al., 2021] is a powerful framework in To deal with personalized rewards that are ubiquitous in applications such as recommendation systems, Maghakian et al. [2022] study a version of IGL with context-dependent feedback, but their algorithm does not come with theoretical guarantees. Building on this estimator, we propose two algorithms, one based on explore-then-exploit and the other based on inverse-gap weighting.

algorithm, experiment, reward estimator, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > California (0.14)
North America > United States > Iowa (0.04)
North America > United States > Illinois > Champaign County > Urbana (0.04)
Europe > Middle East > Malta (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Provably Efficient Interactive-Grounded Learning with Personalized Reward

Zhang, Mengxiao, Zhang, Yuheng, Luo, Haipeng, Mineiro, Paul

arXiv.org Machine LearningMay-31-2024

Interactive-Grounded Learning (IGL) [Xie et al., 2021] is a powerful framework in which a learner aims at maximizing unobservable rewards through interacting with an environment and observing reward-dependent feedback on the taken actions. To deal with personalized rewards that are ubiquitous in applications such as recommendation systems, Maghakian et al. [2022] study a version of IGL with context-dependent feedback, but their algorithm does not come with theoretical guarantees. In this work, we consider the same problem and provide the first provably efficient algorithms with sublinear regret under realizability. Our analysis reveals that the step-function estimator of prior work can deviate uncontrollably due to finite-sample effects. Our solution is a novel Lipschitz reward estimator which underestimates the true reward and enjoys favorable generalization performances. Building on this estimator, we propose two algorithms, one based on explore-then-exploit and the other based on inverse-gap weighting. We apply IGL to learning from image feedback and learning from text feedback, which are reward-free settings that arise in practice. Experimental results showcase the importance of using our Lipschitz reward estimator and the overall effectiveness of our algorithms.

algorithm, algorithm 2, reward estimator, (14 more...)

arXiv.org Machine Learning

2405.20677

Country:

North America > United States > California (0.14)
North America > United States > Illinois > Champaign County > Urbana (0.04)
Europe > Middle East > Malta > Port Region > Southern Harbour District > Valletta (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.66)

Add feedback

Multi-Fidelity Reinforcement Learning for Time-Optimal Quadrotor Re-planning

Ryou, Gilhyun, Wang, Geoffrey, Karaman, Sertac

arXiv.org Artificial IntelligenceMar-12-2024

High-speed online trajectory planning for UAVs poses a significant challenge due to the need for precise modeling of complex dynamics while also being constrained by computational limitations. This paper presents a multi-fidelity reinforcement learning method (MFRL) that aims to effectively create a realistic dynamics model and simultaneously train a planning policy that can be readily deployed in real-time applications. The proposed method involves the co-training of a planning policy and a reward estimator; the latter predicts the performance of the policy's output and is trained efficiently through multi-fidelity Bayesian optimization. This optimization approach models the correlation between different fidelity levels, thereby constructing a high-fidelity model based on a low-fidelity foundation, which enables the accurate development of the reward model with limited high-fidelity experiments. The framework is further extended to include real-world flight experiments in reinforcement learning training, allowing the reward model to precisely reflect real-world constraints and broadening the policy's applicability to real-world scenarios. We present rigorous evaluations by training and testing the planning policy in both simulated and real-world environments. The resulting trained policy not only generates faster and more reliable trajectories compared to the baseline snap minimization method, but it also achieves trajectory updates in 2 ms on average, while the baseline method takes several minutes.

fidelity level, trajectory, waypoint, (14 more...)

arXiv.org Artificial Intelligence

2403.08152

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
North America > United States > Virginia (0.04)
(2 more...)

Genre: Research Report (0.63)

Industry: Transportation > Air (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Why Guided Dialog Policy Learning performs well? Understanding the role of adversarial learning and its alternative

Shimoyama, Sho, Morimura, Tetsuro, Abe, Kenshi, Takamichi, Toda, Tomomatsu, Yuta, Sugiyama, Masakazu, Hentona, Asahi, Azuma, Yuuki, Ninomiya, Hirotaka

arXiv.org Artificial IntelligenceJul-13-2023

Dialog policies, which determine a system's action based on the current state at each dialog turn, are crucial to the success of the dialog. In recent years, reinforcement learning (RL) has emerged as a promising option for dialog policy learning (DPL). In RL-based DPL, dialog policies are updated according to rewards. The manual construction of fine-grained rewards, such as state-action-based ones, to effectively guide the dialog policy is challenging in multi-domain task-oriented dialog scenarios with numerous state-action pair combinations. One way to estimate rewards from collected data is to train the reward estimator and dialog policy simultaneously using adversarial learning (AL). Although this method has demonstrated superior performance experimentally, it is fraught with the inherent problems of AL, such as mode collapse. This paper first identifies the role of AL in DPL through detailed analyses of the objective functions of dialog policy and reward estimator. Next, based on these analyses, we propose a method that eliminates AL from reward estimation and DPL while retaining its advantages. We evaluate our method using MultiWOZ, a multi-domain task-oriented dialog corpus.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2307.06721

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

A Domain-Independent Agent Architecture for Adaptive Operation in Evolving Open Worlds

Mohan, Shiwali, Piotrowski, Wiktor, Stern, Roni, Grover, Sachin, Kim, Sookyung, Le, Jacob, De Kleer, Johan

arXiv.org Artificial IntelligenceJun-9-2023

Model-based reasoning agents are ill-equipped to act in novel situations in which their model of the environment no longer sufficiently represents the world. We propose HYDRA - a framework for designing model-based agents operating in mixed discrete-continuous worlds, that can autonomously detect when the environment has evolved from its canonical setup, understand how it has evolved, and adapt the agents' models to perform effectively. HYDRA is based upon PDDL+, a rich modeling language for planning in mixed, discrete-continuous environments. It augments the planning module with visual reasoning, task selection, and action execution modules for closed-loop interaction with complex environments. HYDRA implements a novel meta-reasoning process that enables the agent to monitor its own behavior from a variety of aspects. The process employs a diverse set of computational methods to maintain expectations about the agent's own behavior in an environment. Divergences from those expectations are useful in detecting when the environment has evolved and identifying opportunities to adapt the underlying models. HYDRA builds upon ideas from diagnosis and repair and uses a heuristics-guided search over model changes such that they become competent in novel conditions. The HYDRA framework has been used to implement novelty-aware agents for three diverse domains - CartPole++ (a higher dimension variant of a classic control problem), Science Birds (an IJCAI competition problem), and PogoStick (a specific problem domain in Minecraft). We report empirical observations from these domains to demonstrate the efficacy of various components in the novelty meta-reasoning process.

artificial intelligence, machine learning, planning & scheduling, (17 more...)

arXiv.org Artificial Intelligence

2306.06272

Country:

Africa > South Sudan > Equatoria > Central Equatoria > Juba (0.04)
North America > United States > Oklahoma > Payne County > Cushing (0.04)
Europe > Slovenia > Central Slovenia > Municipality of Komenda > Komenda (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment > Games > Computer Games (0.66)
Media > Television (0.46)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(2 more...)

Add feedback

Fast Offline Policy Optimization for Large Scale Recommendation

Sakhi, Otmane, Rohde, David, Gilotte, Alexandre

arXiv.org Artificial IntelligenceMay-27-2023

Personalised interactive systems such as recommender systems require selecting relevant items from massive catalogs dependent on context. Reward-driven offline optimisation of these systems can be achieved by a relaxation of the discrete problem resulting in policy learning or REINFORCE style learning algorithms. Unfortunately, this relaxation step requires computing a sum over the entire catalogue making the complexity of the evaluation of the gradient (and hence each stochastic gradient descent iterations) linear in the catalogue size. This calculation is untenable in many real world examples such as large catalogue recommender systems, severely limiting the usefulness of this method in practice. In this paper, we derive an approximation of these policy learning algorithms that scale logarithmically with the catalogue size. Our contribution is based upon combining three novel ideas: a new Monte Carlo estimate of the gradient of a policy, the self normalised importance sampling estimator and the use of fast maximum inner product search at training time. Extensive experiments show that our algorithm is an order of magnitude faster than naive approaches yet produces equally good policies.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2208.05327

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Virginia > Arlington County > Arlington (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(3 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)

Add feedback